Making Logistic Regression A Core Data Mining Tool

نویسندگان

  • Paul Komarek
  • Andrew Moore
چکیده

Binary classification is a core data mining task. For large datasets or real-time applications, desirable classifiers are accurate, fast, and automatic (i.e. no parameter tuning). Naive Bayes and decision trees are fast and parameter-free, but their accuracy is often below state-of-the-art. Linear support vector machines (SVM) are fast and have good accuracy, but current implementations are sensitive to the capacity parameter. SVMs with radial basis function kernels are accurate but slow, and have multiple parameters that require tuning. In this paper we demonstrate that a very simple parameter-free implementation of logistic regression (LR) is sufficiently accurate and fast to compete with state-of-the-art binary classifiers on large real-world datasets. The accuracy is comparable to per-dataset tuned linear SVMs and, in higher dimensions, to tuned RBF SVMs. A combination of regularization, truncated-Newton methods, and iteratively re-weighted least squares make this implementation faster than SVMs and relatively insensitive to parameters. Our fitting procedure, TR-IRLS, appears to outperform several common LR fitting procedures in our experiments. TR-IRLS is robust to linear dependencies and scaling problems in the data, and no data preprocessing is necessary. TR-IRLS is easy to implement and can be used anywhere that IRLS is used. Convergence guarantees can be stated for generalized linear models with canonical links.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study to Improve the Response in Email Campaigning by Comparing Data Mining Segmentation Approaches in Aditi Technologies

Email marketing is increasingly recognized as an effective Internet marketing tool. In this study, a questionnaire is constructed and distributed to a sample of 146 prospects of Aditi Technologies to find the factors associated with higher response rates. The collected data is analyzed using Factor Analysis and the 11 factors, From Line, Subject Line, Personalization of the subject line, Timing...

متن کامل

Data Mining for Decision Making in Direct Marketing: a Bayesian Networks Approach with Evolutionary Programming

Given the explosive growth of customer and transactional information, data mining can potentially discover new knowledge to improve managerial decision making in marketing. This study proposes an innovative approach to data mining using Bayesian Networks and evolutionary programming and applies the methods to direct marketing data. The results suggest that this approach to knowledge discovery c...

متن کامل

Prediction of the main caving span in longwall mining using fuzzy MCDM technique and statistical method

Immediate roof caving in longwall mining is a complex dynamic process, and it is the core of numerous issues and challenges in this method. Hence, a reliable prediction of the strata behavior and its caving potential is imperative in the planning stage of a longwall project. The span of the main caving is the quantitative criterion that represents cavability. In this paper, two approaches are p...

متن کامل

Calculating classifier calibration performance with a custom modification of Weka

Calibration is often overlooked in machine-learning problem-solving approaches, even in situations where an accurate estimation of predicted probabilities, and not only a discrimination between classes, is critical for decision-making. One of the reasons is the lack of readily available open-source software packages which can easily calculate calibration metrics. In order to provide one such to...

متن کامل

Principal Component Analysis as an Integral Part of Data Mining in Health Informatics

Linear and logistic regression are well-known data mining techniques, however, their ability to deal with interdependent variables is limited. Principal component analysis (PCA) is a prevalent data reduction tool that both transforms the data orthogonally and reduces its dimensionality. In this paper we explore an adaptive hybrid approach where PCA can be used in conjunction with logistic regre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005